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ABSTRACT 

Computerized adaptive testing procedures (CATPs) 
based on the graded response method (GRM) of F. Samejima (1969) and 
the partial credit model (PCM) of G. Masters (1982) were developed 
and compared. Both programs used maximum likelihood estimation of 
ability, and item selection was conducted on the basis of 
information. Two simulated data sets, one wi h 1,000 simulated 
examinees and one with 500 simulated examinees, were generated 
according to a linear analytic model. Both contained responses to 180 
five-alternative items, of which 55 were retained for 997 simulated 
examinees with infit statistics between -3.0 and 3.0. The MULT I LOG 
calibration program of D. J. Thissen (1988) was used to obtain item 
parameter estimates for both models from the data set. It was 
expected that using a data set fitted to the PCM modol would result 
in no differences between PCM and GRM CATPs. However, the GRM CATPs 
provided more accurate information than did the PCM CATPs and the 
estimation was considered adequate. Reasons for difficulties with the 
PCM model are discussed. Two tables and f A ve graphs present study 
data. (SLD) 
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OBJECTIVES 

^ The objectives of this research were (a) to develop and implement a computerized 

£^ adaptive testing (CAT) procedure based on Suaejuua's (1969) graded response (GR) model 

£J and Masters' (1982) partial credit (PC) model, and (b) to compare the GR-based CAT 

performance in ability estimation with that of the PC-based CAT. 
Q MODEL DESCRIPTIONS 

(x) The two polychotomous models, GR and PC, are appropriate for items with ordered 

responses, such as aptitude and achievement test items whose alternatives are inherently 
ordered or have been ordered according to degree of correctness (e.g., through partial 
credit scoring). In addition, attitude questionnaires and ratings data may atao be fitted 
by either model. 

The GR model is a direct extension of the two-parameter model. As a result, the GR 
model contains a parameter which allows an assessment of an item's capacity to 
discriminate among examinees. In the GR model the examinee responses to item i are 
categorized into mi + 1 categories, where higher categories indicate more of 6 and mj is the 
number of category boundaries. Associated with each category of item i is a category 
score, xi, with values 0..mi. The GR model may be expressed as : 

Dfli<9 - 6 xi ) 

where 6 is the latent trait, a [ is the discrimination parameter for item i, b x j is the 

difficulty parameter for category score x for item i, and the scaling constant D equals 
1.702. P x j is the probability, p X j, of the examinee responding in category score xj or 

higher for a given item; the probability of responding in the lowest category (i.e., PtfO)) or 

higher is defined as 1.0. For instance, for an item with four response categories P2OO is 

the probability of responding in categories 2 or 3 rather than in categories 0 or 1 
Because P X j is the probability of responding in xj or higher, the probability of responding 

in a particular category equals the difference between cumulative probabilities for 
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adjacent categories (e.g., p2(9) ■ P2(9) - P3(9)). When an item consists of two categories 
(correct and incorrect), the GR model reduces to the two-parameter model. 

In contrast to the GR model, the PC model provides a direct expression of the 
probability of an examinee with ability 6 responding in a particular category. In the PC 
model the examinee-item interaction is modeled as : 

*i 

X(B- *xi) 
e j=0 

P *i (9)=_ k (2) ' 

KB-**;) 

where 6 is the latent trait, b xi is the difficulty parameter of the step associated with 
category score x\ of item i with mi categories, where xi-l..m;. A category score reflects 
the number of successfully completed steps. A "&tep" is simply a stage required to 
complete an item. For instance, the problem ((6/3)+2) 2 is considered to contain three 
steps because there are three separate stages which must be completed (in a specific 
order) to correctly answer the problem (i.e., step 1 : 6/3, step 2 : the addition of 2 to the 
quotient, and step 3 : the squaring of the quantity) For notatwnal convenience 1(9 - ft x j) 

where j=0 is defined as being equal to zero. 

Because the PC model is an extension of the Rasch model it assumes that all items are 
equally good at discriminating among examinees. In addition, as a member of the Rasch 
family, the PC model's item and person parameters may be estimated on the basis of the 
existence of si Ticient statistics. Specifically, an examinee's test score contains all the 
information for estimating his or her ability and the items' difficulties may be estimated 
from a simple count of the number of persons completing each "step" of an item. Unlike 
the GR model, the ?C model requires that the steps within an item be completed in 
sequence, although the srps need not be equally difficult nor be ordered in terms of 
difficulty. If an item consists of only two categories, then the PC model reduces to the 
Rasch model. 

Except for a few researchers (e.g., Dodd, Koch, & De Ayala, 1989; De Ayala, 1989; 
Sympson, 1986) CAT research has been primarily concerned with dichotomous item 
response theory (IRT) models. However, a number of exams are scored in a graded fashion. 
For example, statistics, mathematics, chemistry, and physics exams are typically graded 
by given partial credit for some incoircct answers. It is reasonable and desirable (i.e., for 
the acceptance of CAT application to these area) to expect that CAT implementations in 
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these subjects to incorporate a graded scoring system. In addition, relative to a 
dichotomous model-based CAT the use of a polychotomous model permits the use of the 
examinee's partial knowledge of the correct response for their ability estimation and 
should result in decreased test length; the existence of information in incorrect responses 
has been demonstrated in several studies (Levine & Drasgow, 1983; Thissen, 1976). 

METHOD 

Programs : Two CAT programs were written, one program was based on the PC model (called 
the PC CAT), whereas the other was based on the GR model (GR CAT). Both programs used 
maximum likelihood estimation of ability and item selection was on the basis of 
information. The adaptive testing simulation was terminated when either of two criteria 
were met : a maximum of twenty items was reached o> when a predetermined standard 
error of estimate (SEE) was obtained (SEE termination criteria of 0.10, 0.25, 0.30 were 
used). Previous work with polychotomous model-based CATs has shown that SEE results 
in better CAT performance than does the minimum item information criterion (e.g., Dodd, 
Koch, & De Ayala, 1989). The initial ability estimate for an examinee was the 
population's mean. 

Data : Two simulation data sets were generated according to a linear factor analytic model 
(Wherry, Naylor, Wherry, & Fallis, 1965). Both data set were unidimensional and 
contained responses to 180 5-al tentative items. One data set contained 1000 simulees 
(randomly selected from a N(0,1) distribution) and was used for obtaining item parameter 
estimates; this data was called the calibration data set. The second data set (called the 
CAT data set) consisted of responses from 500 simul;-,s (randomly selected from a N(0,1)) 
to the same 180 items as the calibration data set; the z- values used for generating 
responses were considered to be the simulees' true ability (Qj). The CAT data set was 

used for the simulated CATs. The use of a linear factor analytic approach for generating 
the data sets minimized any bias in favor of one IRT model or the other. All factor 
loadings were uniformly high and ranged from 0.62 to 0.85. Further, the use of separate 
data sets for calibration and CAT simulations minimize capitalizing on chance by using 
the same data set for both the calibration of the item pool as well as in the CAT 
simulations. 

MULTELOG (Thissen, 1988) was used to obtain item parameter estimates for both the 
PC and GR models from the calibration data set. The use of a single calibration program 
for both models controlled for differences in the implementation of estimation algorithms 
when different calibration programs are used. Although the item parameter estimates 
used for the CAT simulations were obtained from MULTILOG, MSTEPS (Wright, Congdon, & 
Schultz, 1989) was used to obtain fit statistics for the PC model. 
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Analysis : The simulation 1000 examinee by 180-item data set w&s fitted to fhe PC model. 
Items which were found to fit the P° model were used to form an item pool for the PC CAT, 
PC and GR item parameter estimates were obtained i v r this fitted set of items. In addition, 
GR item parameters were estimated for the original 180-item set. The CAT simulations 
were analyzed 3y comparing each CAT's estimated ability (0) with 0j. These comparisons 

involved correlational analysis (Pearson product-moment and Spearman rank-order 
correlation coefficients), standardized root mean squared differences (SRMSD), 
standardized differences between means (SDM), and descriptive statistics. The 
differences between 0 and 6j were graphically examined. Further, descriptive statistics 

on the number of items administered by each CAT were calculated and the relationship of 
SEE to 9 j was also inspected. 

RESULT 

Calibrations 

Fifty-five items with weighted total fit statistics between -3.0 and 3.0 were retained 
for use with the PC CAT. Further, 997 simulees were found to have infit statistics 
between -3.0 and 3.0. Therefore, the PC calibration was performed on 55-item pool (a.k.a., 
the PC calibration data set) with 997 examinees. Item parameter estimates for tue GR 
model were obtained for both the 55-item pool and the original 180-item pool; the three 
examinees identified as not fitting the PC model were retained for the GR calibrations. In 
the following the GR CAT using the 55-item pool will be referred to as the GR-55 CAT, 
whereas GR-180 CAT will indicate the GR CATs with the 180-item pool. Dodd, Koch, and 
De Ayala (1989) and Koch and Dodd (1989) have been successful in using item pools of 
about this size in GR and PC CAT simulations* respectively. The 500 examinee/55-item 
data set used for the CAT simulations will be referred to as the CAT data set. 
Item Pools 

The PC 55-item pool had step difficulty estimates which ranged from -2.365 to 3.124, 
with a positively skewed distribution of difficulties for the first step difficulty, a 
negatively skewed distribution of difficulties for the last step difficulty, and more or less 
unimodal difficulty distributions for the second and third step difficulties. The GR 55- 
and 180-item pools had average discrimination estimates of 1.320 (median=1.300, 
standard deviation=0.l02) and 1.467 (medians 1.45 3, standard deviation=0.255), 
respectively. The difficulty estimates for the GR 55-item pool ranged from 4.093 to 4.189 
and from -4.527 to 4.924 for GR 180-item pool. For all category scores in the GR 180-item 
pool the difficulty estimates tended to be normal-like in distribution, whereas for the GR 
55-item pool the distributions for the first and third difficulty estimates were positively 
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skewed and these distributions were rectangular-like for the second and fourth 
difficulties. Given Urry's (1977) guidelines, the item pool for the GR CATs consisted of 
desirable items. It would have been desirable to have items with step difficulties below 
-2.365, however, the absence of these items was not problematic for the PC CAT. Figure 1 
shows the total item pool information for both the GR-55 and the PC item pools; the 
estimate of the information function for the GR-180 item pool was similar to and about 
twice that of the GR-55 item pool. As can be seen the PC 55-item pool provides greater 
information than the GR-55 item pool for the approximate range -2.25 to 2.5. Because the 
simulees abilities were generated from a normal distribution the majority of the 
examinees had abilities within ±2.0 standard deviations about 0.0. The observed percent 
of examinees w*th abilities g /eater than 2.0 and less than -2.0 was 14% and only 1.4% of 
the simulees had abilities outside the range -3.0 to 3.0. 



Insert Figure 1 about here 

CAT Simulations 

For the PC CAT simulations the correlation coefficients between 6 and 8 j decreased 
with increases in the SEE termination criterion. As can be seen from Table 1 all 
correlation coefficients are equal to or above 0.93 and the corresponding scatterplots 
showed strong linear associations. The correlation coefficients for the GR-180 CAT 
simulations followed the same pattern as for the PC CAT simulations, albeit with slightly 
higher values. In contiast, for the GR-55 CAT increases in the SEE termination criterion 
had no effect on the correlation coefficients between 0 and 0j. The linear relation between 

A 

6 and 8 j as assessed by the Pearson product-moment correlation coefficient was slightly 
higher for the GR-55 CAT than for the PC CAT, although the Spearman rank-order 
coefficients were lower for the GR-55 CAT than those of the PC CAT for all SEE termination 
criteria, except for the SEE termination criterion of 0.30. On the average, the GR CATs 
administered slightly longer tests than did the PC CAT 

Insert Table 1 about here 

SRMSD provides an assessment of the accuracy of estimation across examinees, while 
SDM assesses the overall bias between the 6 s and Ojs. The SRMSD and SDM for the CATs 

are presented in Table 2. As can be seen, regardless of whether the 55-or 180-item pool 
was used the SRMSDs for the GR CATs were approximately one-third that of the PC CATs. 
This indicated that the GR CATs were providing ability estimates which were 
comparatively more accurate that those of the PC CATs. On average, the GR-180 CATs 6 
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were very similar to 6 T (6 T = 0.076 for the 494/497 convergent cases). Similarly, the GR- 

55 CATs A were close to the average 6 T (8 T ■ 0.076 for the 489/491 convergent cases). 
Further, the SDMs for the GR CATs revealed a slight overall underestimation of 6 T . The 
bias for the GR-55 and GR-180 CATs (SEE = 0.25) is graphically depicted in Figures 2 and 
3, respectively. These figures are typical of the pattern exhibited by the other GR CATs. 
As can be seen from these figures, the GR CATs had a tendency to overestimate 6 T > 1.0 and 
to underestimate 6j < -i.0. 

Insert Table 2 about here 



Figures 2 and 3 about here 
The SDMs for the PC CATs showed that there was a strong tendency to overestimate Gj. 
This was also apparent from a comparison of the mean 6 and the average 6 j of 0.083 for all 
500 simulees (for the 470 convergent cases the 6 T = -0.031). The relationship between the 

A 

(9-9 T ) difference and 6 T (Figure 4) for the PC CAT SEE = 0.25 showed that there was a 
tendency to overestimate throughout the ability scale; this pattern was typical of the other 
two PC CATs. It was not surprising given the shape of the information function that, in 
general, larger SEEs (e.g., SEE >0.35) tended to be associated with high 6 (e.g., 9 >3.00). 

Insert Figure 4 about here 

Convergence 

The convergence rate for the GR CATs were over 97.8%. For the GP-180 CATs with 
termination SEEs of 0.10 and 0.25, two of the nonconvergent cases were high ability 
examinees (Gj ■ 2.774 & 2.084), while the third case was a very low ability examinee (£j = 
-3.025); the six nonconvergent cases for the GR-180 CAT (SEE ■ 0.30) had G T s of 1.665, 
1.779, 2.084, 2.774, -1.479, and -1.479. Similarly, the GR-55 CATs nonconvergent cases 
were distributed throughout the ability range. In contrast, thj majoiity of the 
nonconvergent cases /or the PC CAT were associated with Gj = 2.0 simulees (convergence 

rate ■ 94%); the nonconvergence was nonsymmetric. The tnree PC CATs were unable to 

estimate the same 30 simulees and four of the 30 cases were examinees for which the GR- 
55 CAT was unable to obtain an ability estimate (G T =1.784, 6 T =2.576, 8 T =2.774, 6 T 

=3.162). Figure 5 shows the relationship between the GR-55 CATs and the PC CAT 
nonconvergent cases. Infit statistics calculated for the CAT data set revealed thirty-six 
examinees with fit values greater than 2.0, only three of which were nonconvergent cases. 
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Insert Figure 5 about here 
DISCUSSION 

Given the similarity in results for the GR-55 and GR-180 CATs, it appears that item 
pools smaller than are suggested fcr dichotomous model-based CATs can be used with GR 
model-based CATs. It was expected that using a data set which was fitted to the PC model 
would result in no differences between the GR and PC CATs. However, despite this 
characteristic and the fact that the PC model provided more information for 86% of the 
examinees than did the GR model, the GR-55 CATs provided more accurate estimation than 
the PC CATs. In the authors' opinions the results of the GR CATs were acceptable. 

The fitting of the CAT data set to the PC model identified eight items which no longer 
fit the model (i.e., infit values greater than 3.0), although they had fit the PC calibration 
dan set. The misfitting CAT data set items and the examinees were retained for the CAT 
simulations because in an real-life implementation this information would only be 
available post hoc. That is, after a CAT was operational and the misfit information had 
been gathered, it would be difficult to justify to an examinee that he/she had to be 
eliminated because on the basis of his/her performance on the adaptive test he/she was 
found not to fit the CATs IRT model. Conceivably, the misfitting items could be 
eliminated from future use in the CAT, although the items wculd still have had an effect 
on the examinees who had already been administered 'he tailored tests. Therefore, the 
retention of misfitting items and examinees for the CAT simulations was consistent with a 
the procedures of a real-life CAT implementation. Further, given that only three of the 
simulees did not fit the PC model, it does not appear that the PC CAT nonconvergent cases 
were a result of simulees which did not fit the PC model. The role of the misfitting ittms 
on the PC CAT convergence and bias is not known. 

It may be speculated that some of the PC CATs difficulties are z result of MULTILOG's 

implementation of the PC model. That is, in MULTILOG PC parameter estimation requires 

imposing triangular contrasts on Bock's (1972) nominal response (NR) model (cf., Thissen 

& Steinberg, 1986). Imposing these triangular contrasts or the NR model is the logical 

equivalent of making the a priori order assumption necessary for the PC model (Thissen, 

1988; Masters & Wilson, 1988). In this regard, the calibration of the data showed that the 

a of best fit for PC model was 0.754, not a « 1.0 as the Rasch PC model assumes. As would 

be expected given the differences in estimation techniques between MULTILOG and the 

Rasch program MSTEPS, as well as the difference in the approach to fixing the scale's 

origin, the programs' difficulty estimates were not equal. However, thete was a very high 
linear agreement between the two sets of estimates (n>| = 0.989, rb 2 * 0.977, 1^3 « 0.986, 
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rb4 = 0.995). Given the similarity in the magnitudes of the item parameter estimates as 

well as the above correlations it does not appear that the results are due to MULTILOG's 
implementation of the PC model. 

Because the PC and GR CAT programs have been successfully used in previous studies 
(Koch & Dodd, 1989; Dodd, Koch & De Ayala, 1989) it is not likely that the CAT programs 
were at fault. A possible explanation for the PC CAT's difficulties may be the use of an 
infit criterion of ±3.0 for retaining items; a more conservative criterion may be required 
for the creation of PC item pools. Future research will investigate the relationship 
between the degree of fit of items to the PC model for inclusion to an item pool and PC CAT 
ability estimation. 
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Table 1 : Correlation coefficients between 6 and 6 T (6j«0.083, o =1.024) and descriptive 
statistics on CAT 

Correlation Average SD Mean SD Number 

of con* 

CAT SEE Pearson Spearman 6 9 NIA 1 NIA 1 vergent 

cases 

GR-180 CAT 



GR-55 CAT 



PC CAT 



0.10 


0.966 


0.964 


0.068 


1.398 


20.0 


0.0 


497 


0.25 


0.961 


0.958 


0.024 


1.379 


14.396 


1.229 


497 


0.30 


0.937 


0.927 


•0.0 10 


1.391 


10.215 


1.073 


494 


0.10 


0.961 


0.958 


-0.001 


1.481 


20.0 


0.0 


4C9 


0.25 


0.961 


0.958 


-0.001 


1.481 


20.0 


0.0 


489 


0.30 


0.961 


0.958 


-0.004 


1.472 


17.179 


1.073 


491 


0.10 


0.959 


0.973 


0.957 


1.068 


20.0 


0.0 


470 


0.25 


0.948 


0.960 


0.918 


1.100 


12.672 


3.609 


470 


0.30 


0.933 


0.940 


0.925 


1.1 lo 


9.306 


3.983 


470 



NIA 1 : number of items administered 
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Tabic 2 : 5RMSD and SDM for PC and GR C Ts 



CAT SEE 
GR-180 CAT 

0 10 
0.25 
0.30 

GR-55 CAT 

0.10 
0.25 
0.30 

PC CAT 

0.10 
0.25 
0.30 



SRMSD SDM 

0.325 -0.012 

0.332 -0.049 

0.381 -0.077 

0.373 -0.065 

0.373 -0.065 

0.370 -0.064 

1.028 0.984 

0.973 0.929 

0.982 0.928 
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% 




Difference between estimate & true ability 
GR-55 CAT, SEE=0.25 



4-i 
3 - 





theta 
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Nonconvergcnt Cases for PC and GR-55 CATs 



c 
© 




□ PCCAT:SEE=0.30 

■ GR-55 CAT: SEE=0.30 

■ GR-55 CAT: SEE=0.25 

□ GR-55 CAT : SEE=0.10 



Theta 
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